library(mosaic)
library(readr)
library(Stat2Data)
logit = function(B0, B1, x)
  {
    exp(B0+B1*x)/(1+exp(B0+B1*x))
  }
NBA_Data = read_csv("/Users/reidbrown/Documents/Senior/Spring 2020/STOR 455/Homework/Data For HW6/nba.games.stats.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Team = col_character(),
##   Date = col_date(format = ""),
##   Home = col_character(),
##   Opponent = col_character(),
##   WINorLOSS = col_character()
## )
## See spec(...) for full column specifications.
NBA_Data
Hornets_Data = NBA_Data[NBA_Data$Team =="CHO",]
Hornets_Data

Data Prep

I found this data set on Kaggle.com. It is called “NBA Team Game Stats from 2014 to 2018” and it was collected by Ionas Kelepouris. It was last updated 2 years ago, so the data is not current, but is still recent. I want to see how different variables predict the Charlotte Hornets likelihood of winning games. I downloaded the data, read it into R, and sliced out just the Hornets (abbreviated as CHO). The analysis for Homework 6 is done below.

#Use If/Else Statement to recode as a dummy variable
Hornets_Data$Win = ifelse(Hornets_Data$WINorLOSS == "W",1,0)
head(Hornets_Data)

Part 1, A

#Choose a single quantitative predictor and construct a logistic regression model
Hornet_Mod_PartA=glm(Win~TeamPoints,family=binomial,data=Hornets_Data)
summary(Hornet_Mod_PartA)
## 
## Call:
## glm(formula = Win ~ TeamPoints, family = binomial, data = Hornets_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9517  -0.9198  -0.3558   0.9926   2.4020  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.37856    1.35162  -7.679 1.61e-14 ***
## TeamPoints    0.09936    0.01302   7.633 2.30e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 453.23  on 327  degrees of freedom
## Residual deviance: 367.27  on 326  degrees of freedom
## AIC: 371.27
## 
## Number of Fisher Scoring iterations: 4

Part 1, B

#Plot the raw data and the logistic curve on the same axes
plot(Win~TeamPoints, data=Hornets_Data)

B0 = summary(Hornet_Mod_PartA)$coef[1]
B1 = summary(Hornet_Mod_PartA)$coef[2]

curve(exp(B0+B1*x)/(1+exp(B0+B1*x)),add=TRUE, col="red")

Part 1, C

# Construct an empirical logit plot and comment on the linearity of the data
for (i in 5:15) 
{
  emplogitplot1(Win~TeamPoints, data=Hornets_Data, ngroups=i,main=paste(i,"Groups"))
  }

Based on the output of all of the empirical logit plots I generated, it appears that there is a strong, positive, linear realtionship between the points the Hornets scored, and the likelihood of a win.

Part 1, D

H0: β1=0

Ha: β1≠0

#Use the summary of your logistic model to perform a hypothesis test to determine if there is significant evidence of a relationship between the response and predictor variable. State your hypotheses and conclusion

summary(Hornet_Mod_PartA)
## 
## Call:
## glm(formula = Win ~ TeamPoints, family = binomial, data = Hornets_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9517  -0.9198  -0.3558   0.9926   2.4020  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.37856    1.35162  -7.679 1.61e-14 ***
## TeamPoints    0.09936    0.01302   7.633 2.30e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 453.23  on 327  degrees of freedom
## Residual deviance: 367.27  on 326  degrees of freedom
## AIC: 371.27
## 
## Number of Fisher Scoring iterations: 4

Looking at the p-value for our Hornets_Model_PartA, we see that it is small (p = 2.30e-14) and tells us that the predictor TeamPoints is a significant predictor in this model. Because of this p-value is <0.05, we can reject the null hypothesis.

Part 1, E

#Construct a confidence interval for the odds ratio and include a sentence interpreting the interval in the context
#exp() allows us to get a CI for odds ratio not the log(odds ratio)
exp(confint(Hornet_Mod_PartA))
## Waiting for profiling to be done...
##                    2.5 %      97.5 %
## (Intercept) 1.917493e-06 0.000388143
## TeamPoints  1.077931e+00 1.134497376

With 95% conficence the true odds ratio lies in the range of 1.077931-1.134497376. This confidence interval says that for every additional point scored by the Hornets, the odds of winning increase.

Part 1, F

#Compute the G-statistic and use it to test the effectiveness of your model
Hornet_Mod_PartA$null.deviance - Hornet_Mod_PartA$deviance
## [1] 85.95858
anova(Hornet_Mod_PartA, test="Chisq")

The G statistic of the model is 85.95858, and it is computed as the difference between the null deviance and residual deviance in the model. Using anova() and the test=Chisq argument, we can see that the overall fit of the model is strong. Beucase the p-value is 0, we have significant evidence that the slope, β1, is not equal to 0.

Part 2, A

#Choose a single quantitative predictor and construct a logistic regression model
Hornet_Mod_PartB=glm(Win~OpponentPoints,family=binomial,data=Hornets_Data)
summary(Hornet_Mod_PartB)
## 
## Call:
## glm(formula = Win ~ OpponentPoints, family = binomial, data = Hornets_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3450  -0.9868  -0.4943   0.9646   2.2492  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     9.07300    1.29142   7.026 2.13e-12 ***
## OpponentPoints -0.08999    0.01260  -7.143 9.16e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 453.23  on 327  degrees of freedom
## Residual deviance: 385.38  on 326  degrees of freedom
## AIC: 389.38
## 
## Number of Fisher Scoring iterations: 3

Part 2, B

#Plot the raw data and the logistic curve on the same axes
plot(Win~OpponentPoints, data=Hornets_Data)

B0 = summary(Hornet_Mod_PartB)$coef[1]
B1 = summary(Hornet_Mod_PartB)$coef[2]

curve(exp(B0+B1*x)/(1+exp(B0+B1*x)),add=TRUE, col="red")

Part 2, C

# Construct an empirical logit plot and comment on the linearity of the data
for (i in 5:15) 
{
  emplogitplot1(Win~OpponentPoints, data=Hornets_Data, ngroups=i,main=paste(i,"Groups"))
  }

Part 2, D

#Use the summary of your logistic model to perform a hypothesis test to determine if there is significant evidence of a relationship between the response and predictor variable. State your hypotheses and conclusion

summary(Hornet_Mod_PartB)
## 
## Call:
## glm(formula = Win ~ OpponentPoints, family = binomial, data = Hornets_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3450  -0.9868  -0.4943   0.9646   2.2492  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     9.07300    1.29142   7.026 2.13e-12 ***
## OpponentPoints -0.08999    0.01260  -7.143 9.16e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 453.23  on 327  degrees of freedom
## Residual deviance: 385.38  on 326  degrees of freedom
## AIC: 389.38
## 
## Number of Fisher Scoring iterations: 3

Looking at the p-value for our Hornets_Model_PartB, we see that it is small (p = 9.16e-13) and tells us that the predictor OpponentPoints is a significant predictor in this model. Because of this p-value is <0.05, we can reject the null hypothesis.

Part 2, E

# Construct a confidence interval for the odds ratio and include a sentence interpreting the interval in the context

exp(confint(Hornet_Mod_PartB))
## Waiting for profiling to be done...
##                     2.5 %       97.5 %
## (Intercept)    765.959706 1.225249e+05
## OpponentPoints   0.890654 9.358635e-01

With 95% conficence the true odds ratio lies in the range of 0.890654-.9358635. This confidence interval says that for every additional point scored by the opposing team, the odds of winning decrease.

Part 2, F

#Compute the G-statistic and use it to test the effectiveness of your model

Hornet_Mod_PartB$null.deviance - Hornet_Mod_PartB$deviance
## [1] 67.844
anova(Hornet_Mod_PartB, test="Chisq")

The G statistic of the model is 67.844, and it is computed as the difference between the null deviance and residual deviance in the model. Using anova() and the test=Chisq argument, we can see that the overall fit of the model is strong. Beucase the p-value is 0, we have significant evidence that the slope, β1, is not equal to 0.

ASSESSING BEST MODEL

Part H

summary(Hornet_Mod_PartA)
## 
## Call:
## glm(formula = Win ~ TeamPoints, family = binomial, data = Hornets_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9517  -0.9198  -0.3558   0.9926   2.4020  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.37856    1.35162  -7.679 1.61e-14 ***
## TeamPoints    0.09936    0.01302   7.633 2.30e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 453.23  on 327  degrees of freedom
## Residual deviance: 367.27  on 326  degrees of freedom
## AIC: 371.27
## 
## Number of Fisher Scoring iterations: 4
summary(Hornet_Mod_PartB)
## 
## Call:
## glm(formula = Win ~ OpponentPoints, family = binomial, data = Hornets_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3450  -0.9868  -0.4943   0.9646   2.2492  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     9.07300    1.29142   7.026 2.13e-12 ***
## OpponentPoints -0.08999    0.01260  -7.143 9.16e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 453.23  on 327  degrees of freedom
## Residual deviance: 385.38  on 326  degrees of freedom
## AIC: 389.38
## 
## Number of Fisher Scoring iterations: 3
anova(Hornet_Mod_PartA,test="Chisq")
anova(Hornet_Mod_PartB,test="Chisq")

Looking at both the logit plots and the p-values of both models summaries, the single predictor for each model is very significant. Additionally, the G Statistics for both models show that the overall fit of the model is very good because the p-values are both less than 0.05 and are both < 2.2e-16. The logit plots constructed for both models show a strong, positive, linear correlation between the odds ratio and Team Points in Model A, and a similar strong, negative, linear correlation between the odds ratio and Opponent Points in Model B. Overall, both models are similar and both predict well. I would say that Hornets_Mod_PartA is marginally better since Team Points (p-value = 2.30e-14) has a slightly lower p-value in summary() than Opponent Points (p-value = 9.16e-13).